feat: add ScalaUDF support via a codegen dispatcher by mbutrovich · Pull Request #4267 · apache/datafusion-comet

mbutrovich · 2026-05-08T15:36:39Z

Which issue does this PR close?

Closes #.

Rationale for this change

#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: a CometUDF (CometScalaUDFCodegen) that compiles a specialized batch kernel per bound ScalaUDF expression and input schema via Janino. Without this path, any plan containing a ScalaUDF falls back to Spark for the enclosing operator, losing native execution on the surrounding plan.

The dispatcher is one of potentially many CometUDF implementations the bridge can route to. Hand-written CometUDFs for specific expression families (e.g. regex in #4239, JSON in #4305) remain a parallel path; the bridge dispatches by class name from the proto and does not require everything to go through the dispatcher.

Benefits:

Any ScalaUDF whose argument and return types are in the supported surface routes through native without a hand-written CometUDF.
The dispatcher binds the entire ScalaUDF argument tree, so Catalyst sub-expressions inside the UDF (upper(s), concat(c1, c2), monotonically_increasing_id()) compile into the same per-row loop as the user function.
Surrounding native operators stay native; the UDF is no longer a whole-operator fallback boundary.

Opt-in via spark.comet.exec.scalaUDF.codegen.enabled (default true). When disabled, plans containing a ScalaUDF fall back to Spark for that operator.

The CometUDF contract loosens from "should be stateless" to "may hold per-task state in fields." One instance per Spark task attempt per class, reused across all batches of the task, dropped on task completion. Per-instance access is single-threaded because Spark runs one native future per partition and Tokio polls one future per worker at a time.

What changes are included in this PR?

Generic codegen infrastructure under org.apache.comet.codegen: CometBatchKernelCodegen (orchestrator) + CometBatchKernelCodegenInput / CometBatchKernelCodegenOutput (per-side emission) + CometBatchKernel Java base + CometInternalRow / CometArrayData / CometMapData shim bases + CometSpecializedGettersDispatch for shared get(ordinal, dataType) dispatch. The framework is generic over Catalyst expressions; today's only consumer is the ScalaUDF dispatcher.
ScalaUDF dispatcher under org.apache.comet.udf.codegen: CometScalaUDFCodegen (bridge entry, compile cache, per-partition kernel state).
Complex type support: ArrayType, StructType, and MapType as both input and output, including arbitrary nesting. Sealed ArrowColumnSpec plus recursive nested-class emission.
Optimization set applied per (expression, input schema): zero-copy UTF8 reads on VarCharVector, non-nullable isNullAt elision, decimal short-precision fast path on both sides, UTF8 on-heap write shortcut, pre-sized variable-length output buffers, NullIntolerant short-circuit, non-nullable output short-circuit, nullable-element elision on array / map writes, subexpression elimination. Complex-type output writes hoist getChildByOrdinal and cast to once-per-batch setup so the per-row body has no runtime type dispatch and no redundant casts. In-code TODOs flag three further optimizations the input side has and the output side does not yet (UTF8 inline-unsafe write, cached write-buffer addresses, nested var-width sizing).
Bridge instance cache: ConcurrentHashMap<Long, ConcurrentHashMap<String, CometUDF>> keyed by (taskAttemptId, className) with a TaskCompletionListener evicting the per-task entry. Invariant to Tokio work-stealing across batches: a task that migrates between workers still sees the same instance. Assertions on every invariant (single listener registration, non-null cache, reflective-instantiate success, TaskContext install effect).
Serde routing: CometScalaUDF routes any ScalaUDF whose tree passes CometBatchKernelCodegen.canHandle. Proto build is inlined; no other expressions adopt the dispatcher in this PR.
Allocation reuses Utils.toArrowField and Field.createVector for every output type. Input spec derives Spark DataTypes via Utils.fromArrowField. Exception paths close partially allocated vectors to avoid leaks. The Arrow Field is computed once per (expression, schema) cache entry rather than per batch.
User guide page docs/source/user-guide/latest/jvm_udf_dispatch.md covers the on/off config, supported and unsupported types, behavior notes, and the cross-query recompile caveat. Architecture lives in Scaladoc on CometScalaUDFCodegen and CometBatchKernelCodegen; in-code TODOs carry the open items.

How are these changes tested?

CometCodegenSourceSuite: generated-source assertions for every optimization and every complex-type shape.
CometCodegenDispatchSmokeSuite: end-to-end correctness across the scalar and complex type surface (primitives, binary input and output, decimal precision boundaries, date / timestamp / timestampNTZ, array / struct / map round-trips including nested shapes and primitive-keyed maps), composed UDF trees, subquery reuse, TaskContext propagation, per-task cache isolation across sequential runs, kernel-cache reuse across batches of one query, ScalaUDF as a child of a native Spark expression.
CometCodegenDispatchFuzzSuite: randomized decimal identity fuzz at several null densities.
CometScalaUDFCompositionBenchmark: Spark vs Comet with the dispatcher enabled vs disabled, over three composed-UDF shapes.

…UDFs

mbutrovich · 2026-05-08T21:49:51Z

There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward.

…r JNI

…ted body" on Spark 3.5

…scala_udf

# Conflicts: # common/src/main/java/org/apache/comet/udf/CometUdfBridge.java # common/src/main/scala/org/apache/comet/udf/CometUDF.scala # docs/source/contributor-guide/index.md # native/core/src/execution/jni_api.rs # native/core/src/execution/planner.rs # native/spark-expr/src/jvm_udf/mod.rs # spark/src/main/scala/org/apache/comet/CometExecIterator.scala # spark/src/main/scala/org/apache/comet/serde/strings.scala

…s, tests pass

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala…

1746bcc

…UDFs

This was referenced May 8, 2026

feat: add experimental support for Spark regexp expressions via JVM UDF framework #4239

Open

feat: add user-facing CometUDF registration for custom JVM UDFs #4233

Draft

mbutrovich and others added 9 commits May 8, 2026 11:44

prettier, add new suites to CI checks.

08d6b78

make format, fix shims for 4.0+

557752e

make format, fix shims for 4.0+

896f61f

Merge branch 'main' into codegen_scala_udf

a82e160

strengthen tests for composed expressions

2a158f4

make format, again.

654bbad

fix pr_benchmark_check.yml

10df7e0

fix arrow shading issue in CI.

7afe69f

fix Spark 4.0 collation expression shim

0dc5855

mbutrovich and others added 17 commits May 8, 2026 19:44

apply common subexpression elimination, add tests for subqueries in UDFs

43a7b0c

make format

9640897

decimal fast path. document 64KB limitation right now

f0c8296

pass through task context to get around tokio worker pool calling ove…

2173f40

…r JNI

fix compilation on scala 2.12, fix format issue

2f9585b

Merge branch 'main' into codegen_scala_udf

582cd17

decimal output, utf8 output, non-nullable output optimizations

22f3256

optimization menu

7666715

estimate binaryview and binary size

0a34636

fix "CSE collapses a repeated subtree to one evaluation in the genera…

e94b6db

…ted body" on Spark 3.5

Merge remote-tracking branch 'origin/codegen_scala_udf' into codegen_…

d0f1f27

…scala_udf

add some complex type support, remove apache#4239 code. update docs.

07e37ea

split codegen input and output, basic struct WIP

ebf77c4

split massive codegen file, handle recursive nested types

6836c30

map input

5d91a8f

more struct support

2a28aaf

revert some benchmark changes

0c6586a

mbutrovich and others added 4 commits May 9, 2026 23:17

clean up test names

d7b43fc

fix format

034e1f5

Merge branch 'main' into codegen_scala_udf

317feaf

Merge branch 'main' into codegen_scala_udf

db1f1f2